Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-algorithm deterministic cuDNN convolutions #34951

Conversation

duncanriach
Copy link
Contributor

@duncanriach duncanriach commented Dec 9, 2019

This current pull request intends to address a bug in the functionality of the environment variable TF_CUDNN_DETERMINISTIC (see PR 24747) and also the environment variable TF_DETERMINISTIC_OPS (see PR 31465).

The current implementation (before application of this current pull request) of deterministic cuDNN convolution in TensorFlow chooses, for any layer configuration, one fixed deterministic algorithm for each of the forward and two backward propagation paths.

I have since come to appreciate that each algorithm is not guaranteed to work on all layer configurations. The solution represented by this current PR addresses that problem. It uses the existing auto-tuning mechanism to attempt to use all of the available deterministic algorithms, and then, instead of choosing the fastest one (as regular auto-tuning does), this solution chooses the first deterministic algorithm that works. It does this in a deterministic way, based on its index in a intentionally ordered array.

@tensorflow-bot tensorflow-bot bot added the size:L CL Change Size: Large label Dec 9, 2019
@gbaned gbaned self-assigned this Dec 9, 2019
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Dec 9, 2019
@gbaned gbaned requested a review from timshen91 December 9, 2019 04:56

// A helper function to decide whether to enable deterministic cuDNN
// functionality.
bool RequireCuDNNDeterminism();
Copy link
Member

@timshen91 timshen91 Dec 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird for TF and XLA to ask for flags from StreamExecutor, as StreamExecutor is just a library, and it shouldn't impose constraints on users about how to use this library.

Ideally, the source of truth of "users require determinism" should be a flag in TF, and then it get passed down to XLA. Both TF and XLA pass down the same boolean to StreamExecutor. But this ideal solution creates unnecessary amount of work.

How about this:

  • Create a TF flag TF_DETERMINISTIC_AUTOTUNING in tensorflow/core/kernel/gpu_utils.h, which is only used by TF kernels.
  • Create a similar XLA flag to DebugOptions in tensorflow/compiler/xla/xla.proto, which is used by XLA.
  • Keep the current StreamExecutor flags as it is privately in cuda_dnn.cc.

When the end user wants determinism, they have to compose all necessary flags. This way, at least the layering is less coupled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tim,

Adding and Modifying Switches

I strongly prefer to not add additional determinism switches and/or change the functionality of the existing switches. TF_CUDNN_DETERMINISTIC makes cuDNN-based ops function deterministically, and TF_DETERMINISTIC_OPS aspires to making all ops, including those based on cuDNN, operate deterministically. Since a model either operates deterministically or not (there being no useful gray area), TF_DETERMINISTIC_OPS supersedes, and ultimately makes fully redundant, TF_CUDNN_DETERMINISTIC.

At the moment TF_DETERMINISTIC_OPS is being listened to, like a global variable, in both Python code and C++ code. I'm currently working on a deterministic mode for another op, which will also be enabled by TF_DETERMINISTIC_OPS. In time, TF_DETERMINISTIC_OPS will be referenced in more and more places in the code-base. I envision the functionality of TF_DETERMINISTIC_OPS eventually being replaced by a fully plumbed-in config variable.

I have been communicating the status and recipes for TensorFlow determinism here, and I don't want to make that information unnecessarily complicated for the user. Note that I intend to document TensorFlow determinism at a high level in the official TensorFlow documentation soon.

Ideal Solution

I'm not sure what you mean by "a flag in TF." Perhaps you're referring to adding something to tf.config? Or are you just referring to what is represented by RequireCuDNNDeterminism() (the logical OR of TF_CUDNN_DETERMINISTIC with TF_DETERMINISTIC_OPS)?

Following your outline above, I'm thinking that a practical way forward might be to move RequireCuDNNDeterminism() from tensorflow/stream_executor/cuda/cuda_helpers.h/.cc to tensorflow/core/kernels/gpu_utils.h/.cc and use it where needed in tensorflow/core/kernels/* including passing it down in the interface to the GetConvolve*Algorithms() methods in the stream executor.

I currently don't know how to pass this flag down to XLA, but I could work on trying to figure it out. Then XLA could also use it and pass it to the GetConvolve*Algorithms() methods in the stream executor.

Compromise

Instead of passing the flag down to XLA, would it be acceptable to you if I replicated the RequireCuDNNDeterminism() code (which looks at both TF_CUDNN_DETERMINISM and TF_DETERMINISTIC_OPS) in tensorflow/compiler/xla/service/gpu/gpu_conv_algorithm_picker.cc? There, the setting can be both used and passed down to the GetConvolve*Algorithms() methods in the stream executor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit that removes the dependency of both core/kernels:gpu_utils and xla/service/gpu:gpu_conv_algorithm_picker on stream_executor. This has been done simply by replicating the code that listens for, and caches, TF_CUDNN_DETERMINISTIC || TF_DETERMINISTIC_OPS in both core/kernels/gpu_utils.cc and xla/service/gpu/gpu_conv_algorithm_picker.cc

I considered passing a require_determinism flag through the parameter lists of the CudnnSupport::GetConvolve*Algorithms() methods (defined in stream_executor/cuda/cuda_dnn.cc), rather than generating the flag locally within cuda_dnn.cc, but realized that the flag was also needed for selecting the deterministic operation of cuDNN max-pooling, which is also done in cuda_dnn.cc. There was no easy way to propagate the flag in that case.

So currently, the flag is generated independently, and equivalently, in three different places. To prevent code repetition, I would prefer for it to be in a library that can be included by stream_executor, core, and XLA. I don't know what would be the right library for that, however. I originally incorrectly chose stream_executor/cuda:cuda_helpers because of my lack of understanding of the architecture of TensorFlow.

To re-summarize my intentions:

  1. I want to keep the API for determinism as simple as possible for users, which means not changing the meaning or functionality of TF_CUDNN_DETERMINISTIC and TF_DETERMINISTIC_OPS, and also extending the promise that they represent. This pull request represents a bug fix that requires those environment variables to directly control code in XLA and core/kernels.

  2. For this development work, I think it makes sense to be able to make the changes where necessary to attain the deterministic functionality as quickly as possible while maintaining a simple and consistent API. Going forward, this is mostly going to appear as different parts of the codebase listening directly to TF_DETERMINISTIC_OPS. Ultimately, either myself, or someone else, can go though and properly plumb the switch into all the places that have been highlighted by the use of TF_DETERMINISTIC_OPS, possibly from something like tf.config.deterministic_ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that it would be preferable to define RequireCudnnDeterminism() once, in one place, I've been looking for an appropriate place to do that. I'm wondering if tensorflow/core/common_runtime/gpu would make sense. I'm thinking that it could go into a new "module" (.h + .cc) called gpu_determinism. Would it be okay for stream_executor, core, and XLA to import and use this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've found another (even more?) appropriate place to define RequireCudnnDeterminism(): tensorflow/core/util:use_cudnn (use_cudnn.h / use_cudnn.cc).

Copy link
Member

@timshen91 timshen91 Jan 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@duncanriach , thanks for looking at these solutions!

I don't have a sense of priority for this PR, so I'm happy to defer the call to you. Depending on the priority, it's fine by me either to hold on this PR (as you seem to suggest) or to query the env var in multiple places, with comments describing the migration path (your original commits).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're willing to take it, and it seems that you are, then I would much prefer to have a bug fix in place immediately, by querying the env var in multiple places (as specified by the current commits).

Please will you clarify what you mean by comments describing the migration path? Do you mean adding comments explaining the intention to migrate to tf.config.experimental.deterministic_ops and associated plumbing? If so, I would gladly add that.

Also, are you happy with having the code defined by RequireCudnnDeterminism() (and the migration-plan comment) replicated in three different places in the codebase, or would you prefer, as I would, for it to be defined in one place, such as tensorflow/core/common_runtime/gpu:gpu_determinism? Refactoring that would be an easy and quick change to make.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're willing to take it, and it seems that you are, then I would much prefer to have a bug fix in place immediately, by querying the env var in multiple places (as specified by the current commits).

Yes, this is what I meant.

Please will you clarify what you mean by comments describing the migration path? Do you mean adding comments explaining the intention to migrate to tf.config.experimental.deterministic_ops and associated plumbing? If so, I would gladly add that.

Yes, plus verbal warning bits like "this is a temporary solution".

Also, are you happy with having the code defined by RequireCudnnDeterminism() replicated in three different places in the codebase, or would you prefer, as I would, for it to be defined in one place, such as tensorflow/core/common_runtime/gpu:gpu_determinism? Refactoring that would be an easy and quick change to make.

I prefer to duplicate them in several places all with the comment we talked above, also with "this code is duplicated, and should be in sync with [other files]". I don't expect frequent changes to these duplicates, so keeping them in sync shouldn't be too much work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Yes, it's very unlikely that the code will ever need to change (until it's removed). I'll work on an incremental commit to address these items. Thank you, Tim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timshen91, The most recent commit adds a detailed comment to all replicas of the function.

PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Dec 9, 2019
@gbaned
Copy link
Contributor

gbaned commented Dec 17, 2019

@duncanriach Can you please check reviewer comments and keep us posted. Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Dec 17, 2019
@duncanriach
Copy link
Contributor Author

@gbaned. I have responded to the reviewer. Please remove stat:awaiting response.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Dec 21, 2019
@gbaned gbaned added the awaiting review Pull request awaiting review label Dec 24, 2019
@duncanriach
Copy link
Contributor Author

Changes pushed, and discussed here.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Dec 28, 2019
@duncanriach
Copy link
Contributor Author

duncanriach commented Dec 30, 2019

@gbaned, please will you add the awaiting review tag again?

History:

  • 7 days ago: You added the awaiting review tag.
  • 3 days ago: I pushed some changes with an explanation with the intention of moving this PR forward.
  • 2 days ago: The bot removed the awaiting review tag.
  • Today: I added another response.

@gbaned gbaned added the awaiting review Pull request awaiting review label Jan 3, 2020
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Jan 4, 2020
@gbaned gbaned added the awaiting review Pull request awaiting review label Jan 8, 2020
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Jan 16, 2020
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Jan 17, 2020
@tensorflow-bot tensorflow-bot bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Jan 17, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 17, 2020
tensorflow-copybara pushed a commit that referenced this pull request Jan 27, 2020
…tic-cudnn-convolutions

PiperOrigin-RevId: 291684013
Change-Id: I818177de66eeec3dd52e276a5894a1d7a7166459
@tensorflow-copybara tensorflow-copybara merged commit 891416a into tensorflow:master Jan 27, 2020
PR Queue automation moved this from Approved by Reviewer to Merged Jan 27, 2020
@akuegel
Copy link
Member

akuegel commented Jan 27, 2020

This change had to be rolled back. It seems one of our test targets became flaky with this CL.

@duncanriach
Copy link
Contributor Author

duncanriach commented Jan 27, 2020

@akuegel, oh no! Could you tell me which test target became flaky?

(adding link to roll-back commit here)

@sanjoy
Copy link
Contributor

sanjoy commented Jan 29, 2020

@akuegel, oh no! Could you tell me which test target became flaky?

It is an internal target, but AFAICT the target was always flaky. I'm now following up with the team internally.

@duncanriach
Copy link
Contributor Author

Awesome. Thank you, @sanjoy!

@akuegel
Copy link
Member

akuegel commented Jan 30, 2020

I will try to roll forward again.

@duncanriach
Copy link
Contributor Author

Thank you, @akuegel.

ArmageddonKnight pushed a commit to UofT-EcoSystem/tensorflow that referenced this pull request Jan 31, 2020
…deterministic-cudnn-convolutions

PiperOrigin-RevId: 292501090
Change-Id: I31fd5aa4ed36c2929f1300250352781fca749f37
@duncanriach duncanriach deleted the multi-algorithm-deterministic-cudnn-convolutions branch February 11, 2020 22:05
@duncanriach duncanriach restored the multi-algorithm-deterministic-cudnn-convolutions branch February 11, 2020 22:06
@duncanriach duncanriach deleted the multi-algorithm-deterministic-cudnn-convolutions branch May 7, 2020 21:20
@leiwuzheng
Copy link

leiwuzheng commented Nov 20, 2021

Hi @duncanriach , I think your pull introduced non-deterministic behavior of tensorflow 2.5.0. For example, I launched the tensorflow twice to do inference for my convolution graph. using following environment variables:

setenv CUDA_VISIBLE_DEVICES 0 ; setenv TF_CPP_MAX_VLOG_LEVEL 4 ; setenv TF_DUMP_GRAPH_PREFIX ; setenv TF_DETERMINISTIC_OPS 1 ; setenv TF_USE_DEFAULT_CONV_ALGO 0 ; setenv TF_CUDNN_USE_AUTOTUNE 1

The 1st run cudnn autotune giving the following convolution algorithms:
2021-11-21 04:07:25.562823: I tensorflow/core/kernels/conv_ops.cc:1306] Convolution Algorithm: 6
2021-11-21 04:07:25.966770: I tensorflow/core/kernels/conv_ops.cc:1306] Convolution Algorithm: 6
2021-11-21 04:07:26.366184: I tensorflow/core/kernels/conv_ops.cc:1306] Convolution Algorithm: 6
2021-11-21 04:07:26.407513: I tensorflow/core/kernels/conv_ops.cc:1306] Convolution Algorithm: 0
But second run, cudnn autotune gives:
2021-11-21 04:07:48.726334: I tensorflow/core/kernels/conv_ops.cc:1306] Convolution Algorithm: 6
2021-11-21 04:07:49.133102: I tensorflow/core/kernels/conv_ops.cc:1306] Convolution Algorithm: 6
2021-11-21 04:07:49.533022: I tensorflow/core/kernels/conv_ops.cc:1306] Convolution Algorithm: 6
2021-11-21 04:07:49.574372: I tensorflow/core/kernels/conv_ops.cc:1306] Convolution Algorithm: 1
Then, these 2 run will give generate different inference result due to different conv algos are used.

Do you happen to have a work around to avoid this non-deterministic behavior? Turn off autotune by TF_CUDNN_USE_AUTOTUNE, and reply on cudnnGetConvolutionForwardAlgorithm_v7 to select convolution algorithm is a feasible solution? Thanks in advance.

@duncanriach
Copy link
Contributor Author

duncanriach commented Nov 22, 2021

Hi @Leiwu-Zheng, I believe that the code to deterministically select cuDNN convolution algorithms has evolved significantly since this pull request. @kaixih, please could you comment on this?

@kaixih
Copy link
Contributor

kaixih commented Nov 23, 2021

Is this for the TF2 code? For the cudnn frontend API, we will use the first and working deterministic engine when TF_CUDNN_DETERMINISTIC=1. So, it should be able to guarantee an in-run and run-to-run determinism. For the legacy cudnn API (as the algos shown in #34951 (comment)), I feel TF_CUDNN_DETERMINISTIC=1 can only ensure the in-run determinism since it will use a list of deterministic algos but not be certain which one is used. (I might miss something, since I am not very clear about the context of the thread.)

@duncanriach
Copy link
Contributor Author

duncanriach commented Nov 24, 2021

@Leiwu-Zheng, @kaixih and I are working on this together. We'll get back to you, but it will probably be in the new year (2022).

CCing @reedwm as well. Reed, my understanding is that @Leiwu-Zheng is setting TF_DETERMINISTIC_OPS to "1" or "true" in TF2 and getting nondeterministic selection of deterministic algorithms.

Wait, what version of TensorFlow are you using, @Leiwu-Zheng? According to my notes, in version 2.0, you'll need to use TF_CUDNN_DETERMINISTIC, but in version 2.1, and later, you'll need to use TF_DETERMINISTIC_OPS.

Actually, scratch all of that. @Leiwu-Zheng, please will you open an issue with a clear and well-contained reproducer of the problem you're witnessing.

@leiwuzheng
Copy link

leiwuzheng commented Nov 24, 2021

@duncanriach, thanks for the investigation. I have updated my previous comment to make it more clear. I will open an issue. Currently, I am looking for a work around that version 2.5.0 can generate deterministic result for both training and inference, but TF_USE_DEFAULT_CONV_ALGO=1 only works for inference.
@kaixih , I noticed in the version 2.5.0 codes, TF_CUDNN_USE_FRONTEND is turned off by default. It means TF_CUDNN_USE_FRONTEND = 1 could be a workaround for version 2.5.0 to get in-run and run-to-run determinism?

Update:
I just launched tensorflow 2.5.0 10 times to do the inference w/ following settings, it always generate the same result.
setenv CUDA_VISIBLE_DEVICES 0 ; setenv TF_CPP_MAX_VLOG_LEVEL 0 ; setenv TF_DUMP_GRAPH_PREFIX ; setenv TF_DETERMINISTIC_OPS 1 ; setenv TF_USE_DEFAULT_CONV_ALGO 0 ; setenv TF_CUDNN_USE_AUTOTUNE 1 ; setenv TF_CUDNN_USE_FRONTEND 1
@duncanriach , do you still we still need to open a bug issue?

@duncanriach
Copy link
Contributor Author

duncanriach commented Nov 25, 2021

@Leiwu-Zheng, thank you for looking into this more deeply and for discovering this work-around: TF_CUDNN_USE_FRONTEND = "1".

cuDNN convolution algorithm selection should be deterministic when deterministic ops are enabled (via TF_DETERMINISTIC_OPS or tf.config.experimental.enable_op_determinism, in version 2.8 onwards) without having to enable the cuDNN front-end. In the legacy cuDNN API, cuDNN convolution algorithm selection should be deterministic when deterministic ops are enabled.

I think this is a bug.

Please will you confirm that this issue exists in the latest release (version 2.7; if possible) and open an issue (after attempting to confirm that one does not already exist for this). Provide a simple-as-possible, well-contained reproducer that demonstrates this issue on a specific version of TensorFlow. Please tag me, @reedwm, and @kaixih in that new issue, and reference this discussion.

Thank you for doing all this work, Leiwu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process size:L CL Change Size: Large
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet